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INTRODUCTION 


The PLATO Air Force Base Computer-Based Education (PLATO AFB CBE) 
project at Chanute adopted the mastery learning technique in their 34 
lessons and set the mastery criterion at 80% correct on the end of 
lesson test. They used the performance result of each criterion- 
referenced test (CRT) in two different ways: (1) for assessing the 
individual performance, and (2) for evaluation, or more precisely within 
Chanute's context, lesson evaluation. 

The adoption of a criterion-referenced testing approach, to evaluation 
raises two measurement issues that have relatively less importance in norm- 
referenced testing. The issues are (1) definition of mastery, and (2) a 
priori standards. These issues still remain unsolved, but are receiving 
increasing attention. A large number of articles relating to this subject 
have been published, but the many definitions of mastery are by no means 
equivalent. The concerns of these articles are limited to the use of 
criterion-referenced testing for individual assessment, i.e., judging 
whether or not a given student has mastered a given instruction to be 
learned to some suitable level of mastery (Block,1971; Emrick, 1971; 
Hillman, 1973; Besel, 1971; Novick & Lewis, 1974; Roudabush,1974; Huynh, 
1976; Linn,1977). 

One purpose of this paper is to examine the appropriateness of the use 
of CRTs as a mean of controlling an individual student's advancement to 
the next level of instruction or retainment in the current unit of 
instruction in the PLATO AFB CBE Program (or project) at Chanute. 

Our other purpose in this paper is to turn the focus from the aspect 
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of individual assessment to that of program evaluation, which requires the 
establishment of a criterion rate for validation of a lesson, so that a 
lesson would be, considered validated if the percentage of failure rate at 
the end of the lesson was less than the criterion. 

Although there is a mathematical duality in both aspects of criterion- 
referenced testing, it is true that the program evaluation aspect has not 
received all the attention that it deserves. One reason for this is that 
the results of evaluation may call for expensive revisions in instructional 
materials, at least in traditional teaching settings. However, PLATO 
provides an ideal situation for program evaluation because revision of 
lessons can be done with relatively little trouble and expense. 

Therefore, it is important and necessary to explore reliable methods 
that will help to improve the quality of CAI lessons. 
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CRITERION-REFERENCED TEST AS ASSESSMENT OF PROGRAM EVALUATION 

2. 1 Mastery Learning Strategy 

Mastery learning strategies have been used in many educational 
settings since Bloom (1968) advocated them in the late 1960's. In this new 
approach to instruction, a nastery level is set for the material to be 
learned so that a majority of the students must attain the criterion level. 

Interesting findings about mastery learning strategies were reported 
by Carroll (1963), Atkinson (1968) Block (1970), Kim, Hogan, et al. 

(1970, 1971) and many others. According to Block (1971), mastery 
learning allowed 75-90% of students to achieve the same level as the top 
25% of students in usually achieved with typical grouped instructional 
methods such as in regular class rooms. 

A similar study by Kim et al. (1970, 1971) showed that 72% of 
approximately 5800 students in foreign language classes achieved a mastery 
criterion of 80% correct on final tests under the mastery condition while 
only 28% of the traditional condition achieved this level. The high 
percentage of students achieving criterion in the mastery condition shows 
the effectiveness of this strategy of instruction. However, these results 
may also be due partly to the quality of lessons given to the students 
during the experiment, or may even be due to the kinds of tests that were 
given to the students in order to exanine the degree of mastery achieved in 
the instructional unit to be learned. We may be able to say that the high 


quality lessons produce a higher percentage of success than do low quality 
lessons if the tests given at the end of the lessons are comparable to one 








another. 


The experienced instructional designer might say that the quality of 
instruction may be determined by the appropriateness of instructional cues 
and the quality and types of reinforcement given each student, as well as 
the amount of participation and practice experienced by each student. 
Therefore, determining the quality of instruction is a multidimensional and 
complicated task. It is very difficult to measure these factors and develop 
a method of setting validation criteria for CAI lessons based on the 
quantitative data from such complex variables. Since our concern is to 
restrict the discussion to the quantitative method of setting the 
validation criterion of a given lesson, we will start examining the 
validation criterion that has been used in the army, and the PLATO AFB CBE 
Program at Chanute Air Force Base. 

2.2 Validation Criterion of Lessons in PLATO AFB CBE Program 

The PLAT') IV computer-based education system, in development for over 
a decade at the University of Illinois, was used in the training program of 
Special and General Purpose Vehicle Repairmen at Chanute Air Force Base 
(Dallnan, 1977). The 37 CAI lessons in the program, comprising almost 30 
hours of instruction and 37 tests, are imDlenented on the PLATO system 
along with a routing program that provides individualized instructional 
management. The 37 lessons are homogeneous in subject matter and 
tutorial in style tor the most part. They are arranged in mastery 
learning fashion, so that students must achieve the mastery level of the 
test which was given at the end of each lesson in order to be advanced 
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Table 1 


Summary of Master Validation Exams in the Chanute PLATO AFB CBE Project 


Lessons 

M a 

Validation 

Date 

Size of tested 
out sample 

% of 
Success 

% of 
Failure 

Total 

N 

# of 
Success 

103 

30 

10 June 

63 

89% 

11% 

93 

83 

104a 

30 

14 April 

114 

94% 

6% 

144 

134 

104b 

30 

14 April 

113 

86% 

14% 

143 

124 

105 

30 

14 April 

102 

88% 

12% 

132 

117 

106 

30 

19 June 

33 

82% 

18% 

63 

54 

201a 

30 

28 May 

99 

90% 

10% 

129 

116 

201b 

30 

23 May 

109 

72% 

28% 

139 

105 

202a 

30 

18 Aug 

33 

82% 

6^ 

00 

•H 

63 

54 

202b 

30 

28 May 

90 

98% 

2% 

120 

115 

203a 

30 

28 May 

33 

97% 

3% 

63 

59 

203b 

30 

13 June 

33 

94% 

6% 

63 

58 

203c 

30 

18 Aug 

33 

91% 

9% 

63 

57 

204 

30 

18 Aug 

33 

94% 

6% 

63 

58 

205a 

30 

15 Jan 

33 

79% 

21% 

63 

53 

205b 

30 

15 Jan 

33 

82% 

18% 

63 

54 

206a 

30 

13 June 

90 

82% 

18% 

120 

101 

206b 

30 

25 June 

65 

82% 

18% 

95 

80 

206c 

30 

11 April 

118 

95% 

5% 

148 

139 

207 

30 

15 Aug 

33 

91% 

9% 

63 

57 

301 

30 

25 June 

109 

79% 

21% 

139 

113 

304 

30 

25 June 

65 

82% 

18% 

95 

80 

305 

30 

18 May 

109 

96% 

4% 

139 

132 

307 

30 

14 April 

130 

81% 

19% 

160 

132 

308 

30 

18 May 

109 

63% 

37% 

139 

96 

401 

30 

17 April 

142 

83% 

17% 

172 

146 

402 

30 

8 July 

65 

79% 

21% 

95 

78 

403 

30 

30 June 

65 

79% 

21% 

95 

78 

404 

30 

2 Sept 

33 

100% 

0% 

63 

60 


a M is the sample size used for establishing validation dates. 
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(Table 1 cont.) 


Lessons 

M* 

Validation 

Date 

Size of tested 
out sample 

XX of 
Success 

% of 
Failure 

Total 

N 

# of 
Success 

405a 

30 

26 Aug 

33 

100 % 

0% 

63 

60 

405b 

30 

26 Aug 

33 

91% 

9% 

63 

57 

405c 

30 

26 Aug 

33 

94% 

6% 

63 

58 

405d 

30 

2 Sept 

33 

73% 

27% 

63 

51 

406 

30 

30 June 

65 

95% 

5% 

95 

89 

407 

30 

22 Sept 

33 

88% 

12% 

63 

56 










to the next lesson. If the mastery level is not achieved, the student 
must repeat the lesson. The 37 tests consist mostly of matching and 
multiple-choice items, 'fastery levels are aimed at 8 0% level,but the 
actually used cutoff are somewhere between 75% and 90% of the items 
answered correctly. Test lengths vary from 5 to 20 items and the scores 
on the first try of each item are summed to yield the total score of 
each test. The tests are called 11VE, for Master Validation Exams. For 
example, the test at the end of lesson 101 is called MVE101. The 
description of their lessons is given in Appendix2. 

A lesson is said to be validated when 90% of the students have 
achieved the given criterion level of 75% - 90% of the items answered 
correctly in the first attempt on each master validation exam. The sample 
consisted of about 30 students from successive clsses. No major 
modifications of lessons were made until all students in the sample 
finished the lessons. All lessons were validated according to this 
criterion between April and September of 1975. The exact validation dates 
of the lessons are shown in Table 1. In order to validate the 
validation criterion, the lessons that were said to be validated were 
left unchanged during the evaluation period and were tested on more 
students who came in after the validation dates were established. 

It is interesting to note that only 15 out of 34*lessons achieved the 
criterion level of 90% success rate at the end of the evaluation period, 
although all lessons are labeled "validated." Indeed, this result can be 
expected and is not very suprising. The next sections will be devoted for 
explaining the reason. 

*The lessons available for the analysis was reduced to 34 from 37. 











2.3 Bayesian-Binomial Model 


By applying a sample binomial model to the first 30 subjects with 
whom the validation dates were established, we obtain the result that 
tiie probability of failure to meet the validation criterion upon follow¬ 
up testing is 36.3 % . Therefore, 12 out of 34 lessons are predicted to 
be failures. Similarly, the posterior distribution of Bayesian binimial 
model where beta function was taken as a prior distribution predicts 
59. 1 ^ failure to meet the validation criterion (this calculation was 
done by the PLATO version of CADA developed by Mel Novick). In other 
words, 20 out of 34 lessons are predicted to miss the validation 
criterion. Table 1 shows that 19 lessons havea failure rate greater 
tnan 10%, which is very close to the number (20) predicted by the 
Bayesian binomial node!. This fact indicates that it is necessary to 
introduce a more accurate validation criterion for lessons. The reader 
might wander how the prior distribution was chosen here. It was based 
on the belief of the people who participated the PLATO AFB CBE project. 

Producing a lesson to be used on the PLATO system is not a simple 
task. Many steps are involved in the completion of a lesson, including 
tryout with students and gathering empirical evidence which might indicate 
further revision or modification of the lessons. No unique method for 
1 esson-revisior. operation, based on the theories of educational psychology 
and educational measurement, has been developed for use on the PLATO 
system. As signals pointing to the need for revision, some authors choose 
to look at "Area Data," which is collected by the computer, and consists of 
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elapsed time in the area ( a segment of instruction), number of 
questions answered correctly on the first try (Okf's), number of 
incorrect responses to questions (no's), number of correct responses to 
questions (Ok's), and number of helps requested. Others design and 
implement their own data collection routines. These data usually give 
lesson authors a very rough idea of the how well their lessons work with 
students and indicate the areas where the majority of students had 
trouble going through. 

Thus, it is possible for a PLATO lesson author to have some degree of 
confidence in the quality of his lessons by the time the lesson becomes a 
nearly finished product. The degree of his confidence might depend on his 
knowledge of teaching strategies or his past experience. If he uses 
teaching strategies such as mastery learning, which has been examined by 
many researchers and is known to be highly effective, then it is natural to 
assume that he would be highly confident of the quality of his lesson. If 
an author has substantial experience producing lessons on the PLATO system 
and has used them successfully in his class, then his experience will 
assure him of the success of his new lesson. 

It must be true that lessons in which the author has high 
confidence are more likely to produce a higher success rate in a future 
use of his lessons. Suppose p is the true probability of success 
associated with a given lesson; in other words, p % of students achieve 
the mastery level in a population. In general, a Bayesian density 
indicates a state of belief about a pareneter, such as p here, 
intermediate between the estimate "I know nothing about p" and "I know 
the exact value of p." 
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Two types of densities are used, one being the prior density, 
representing beliefs about the parameter before observations are obtained, 
and the other being the posterior density, reprensenting beliefs after 
seeing the data. In our situation, the task is to infer the value of p 
from an observation x. It is clear that p obtained in this way cannot be 
exact: that 20 students passed the test out of 25 students is quite a 
probable number for lessons with the value of p anywhere between .65 and 
.90. But the observation that 80% of students achieved the mastery level 
makes p around .8 more likely for the lesson than p around .3, so we should 
estimate p as .8 if nothing else is known about the quality of the lessons. 
If the author has some information about the lesson, such as that since the 
lesson is dealing with a simple introductory task, the value of .8 is 
somewhat lower than it should be, then we would be more inclined to think 
that the true probability of success associated with the lesson is higher 
than .8. If the author has substantial experience in producing high 
quality lessons in past years, then his new lesson would be more likely to 
be considered to have a higher true probability of success than .8, even 
though the observed success rate is .8 in the sample. Therefore, our 
estimate of the true probability p depends not only on the observed value 
x, but also on what we know about p before observing x. 

The previous knowledge can be expressed by a prior density function 
f(p) (or, also called a prior probability density function). The product 
of f(p) and the likelihood function f(x|p) (i.e., the conditional 
probability of x on given p) gives a quantity proportional to the posterior 
density function f(p|x): 
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f(p|x) = £(x|p)f(p) 


where f(x|p) is called the model density function instead of likelihood, as 
in Bayesian statistics. 

The model density is used for inference in traditional statistics, or 
sampling theory. It is clear that Bayesian statistics uses more 
information than traditional statistics does, i.e.,the prior density 
function. Consequently, Bayesian statistics will provide us with more 
accurate information, at least mathematically, than traditional statistics 
will if a choice of our prior density is the right one. Indeed, it is 
possible to demonstrate such an example, especially if the number of 
observations is fairly small. But it is true that the model density, 
conditional probability of x given p, will have most influence on the 
posterior density when the number of observations is large. 

A detailed discussion of Bayesian binomial model can be found 
elsewhere (Novick and Jackson, 1975; Ferguson, T., 1971). We will show 
only the Bayesian densities in this paper. If we assume the prior belif 
of p follows a beta distribution, then the prior density f(p) is given 
by a beta function: 

p a “l(l-p)b-l 

f(p) = - , Oipil, a>0, b>0 

B(a,b) 

the model density f(x|p) is 

pX-1 (i-p)N-x 

f (x | p) = - 

B(x,N-x-l) 
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the posterior density f( p|x ) is given by 


p a+x-l(i_ p )b+N-x 

f(p|x) = 

8(a+::,b+N-x) 

p (a) ro>) 

where B(a,b) = - , II is tlie nunber of subjects. 

P( a + b) 

Application of the Bayesian binonial model to 34 Chanute lessons will 
be demonstrated in the next section. 
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♦ 4 _ Appropriateness of the Percentage of Success Rate 

The rule for establishing validation of a lesson was that 27 of 30 
students entering the lesson successively must pass the mastery test 
given at the end of the lesson; if this criterion was not met, some 
revision of the lesson was carried out. If we consider the 34 lessons 
are homogeneous, as Dallman (1977) stated in his paper, the model 
density function derived from a sample of size 30 with 27 successful 
attempts predicts a 63.7% chance of success for each lesson in future at 
the time when the validation date was established. 

The corresponding prior density in our situation is obtained from 
the validation criterion (which has been used in CBE programs in the 
Army (Branson et al. 1975): 27 of 30 achieving criterion level. It was 

believed that this rule was adequate to determine the cutoff point for 
terminating the process of lesson modification and beginning to gather 
data for evaluating the PLATO AFB CBE project at Chanute. The belief 
that a 90% rate of success in thirty successive subjects is an adequate 
criterion for validating lessons, can be thought of as the prior 
condition. Therefore, the same beta-binomial distribution function as 
the model density function is taken as a prior density distribution in 
this case. 

Applying Bayes' theorem to prior and model densities, the posterior 
density function is given by beta-binomial function B(53.2, 6.8) with a 
mode of .87 and standard deviation of .04. The 50% credibility Interval is 
given by [.8714, .9244], in which mode .9 and mean .87 are included. 
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In Bayesian statistics, the interval [.8714,.9244] is called a 50% 
credibility interval for the ability (or success rate) because the 50% is 
the measure of the strength of our belief, caking into account our prior 
knowledge and our observation that the student's (or lesson's) ability lies 
in that interval. In particular [.87, .92] is a 50% interval between the 
25th and 75th percentiles and is called the highest-density region in the 
belief, a 50% ’ID?.. The length of the interval .92 - .87 is called an 
interquartile range and is used as a measure of variability of 
distribution. 

As seen in Table 1, t/e have further observations made after the 
validation dates were established. Let us extend our discussion further. 

Table 2 summarizes the results of the Bayesian beta-binomial analysis 
for each lesson based on the expanded sample and newly observed success 
rate. The model density functions of the lessons given in Table 2 were 
derived from the new sample of size given in column 8 and number of 
successes in column 9 of Table 2. The parameters of prior density, 50% HDR 
and probabilities of tt larger than or equal to .9 (Prob(-n£.9), are given in 
Table 2. From the last colunn of Table 2 we may select the lessons whose 
probabilities of being validated lessons are greater than .50. Since all 
standard deviations and interquatile ranges are small, i.e., mostly less 
than .05, the probability that tt is greater than or equal to .85 will be 
drastically ;reater. 

For example, lesson 105 lias Prob( 7t£. 85 ) = .86 while Prob( ii£.9) = .25. 
Therefore, it is recommended that the validation criterion of 90% be 
replaced by a slightly higher value 92% or so. If we defined the validaton 
criterion by a slightly higher success rate, say, 28 out of 30 students 
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Table 2 


Credibility Intervals of Master Validation Exams 
by Baysian Binomial Model 


Observed 


Lessons 

Score 

Mean 

103 

83 

93 

.892 

.89 

104a 

134 _ 
144 

.931 

.93 

104b 

124 

143 

.867 

.87 

105 

117 

132 

.886 

.89 

106 

54 

63 

.857 

.86 

201a 

116 _ 
129 

.899 

.90 

201b 

105 

139 = 

.755 

.75 

202a 

54 _ 
63 

.857 

.86 

202b 

115 _ 
120 

.958 

.95 

203a 

59 _ 
63 

.937 

.93 

203b 

58 _ 
63 

.921 

.92 

203c 

57 = 
63 

.905 

.90 

204 

58 * 
63 " 

.921 

.92 

205a 

53 . 
63 

.841 

.86 

205b 

54 

63 - 

.857 

.86 

206a 

101 
120 “ 

.842 

.85 

206b 

80 a 
95 

.842 

.86 


Mode 

S.D. 

a 

.89 

.03 

109.2 

.93 

.02 

133.2 

.86 

.03 

123.2 

.88 

.03 

116.2 

.84 

.05 

53.2 

.89 

.03 

115.2 

.75 

.04 

104.2 

.84 

.05 

53.2 

.94 

.02 

141 0 2 

.92 

.03 

85.2 

.91 

.04 

57.2 

.89 

.03 

83.2 

.91 

.04 

57.2 

.85 

.04 

79.2 

.84 

.05 

53.2 

.85 

.03 

127.2 

.85 

.03 

106.2 


b 50% Cl P (-rri.. 90) 


13.8 

.8744, .9120 

.36 

10.8 

.9157, .9444 

.87 

19.8 

.8467, .8851 

00 

o 

• 

15.2 

.8665, .9040 

.25 

9.8 

.8238, .8842 

.10 

13.8 

.8800, .9160 

.43 

34.8 

.7280, .7774 

o 

o 

• 

9.8 

.8238, .8842 

.10 

8.8 

.9340, .9588 

.97 

7.8 

.9052, .9425 

.74 

5.8 

.8959, .9425 

.63 

9.8 

.8811, .9228 

.47 

5.8 

.8959, .9425 

.63 

13.8 

.8337, .8826 

• 

o 

00 

9.8 

.8238, .8842 

.10 

22.8 

.8324, .8716 

.97 

18.0 

.8331, .8758 

.05 




(Table 2 cont'd) 


Observed 


Lessons 

Score 

Mean 

Mode 

S.D. 

a 

b 

50% 

Cl 

P (it > 

206c 

139 

148 

.939 

.94 

.93 

.02 

138.2 

9.8 

.9255, 

.9521 

.94 

207 

57 _ 
63 

.905 

.90 

.89 

.03 

83.2 

9.8 

.8811, 

.9228 

.47 

301 

113 _ 
139 

.813 

.83 

.82 

.03 

139.2 

29.8 

.8073, 

.8466 

.00 

304 

80 _ 
95 " 

.842 

.86 

.85 

.03 

106.2 

18.8 

.8331, 

.8758 

.04 

305 

132 _ 
139 

.950 

.94 

.94 

.02 

158.2 

10.8 

.9282, 

.9528 

.96 

307 

132 

160 

.826 

.84 

.83 

.03 

158.2 

31.8 

.8175, 

.8538 

.00 

308 

96 

139 

.691 

.73 

.72 

.03 

1222.0 

46.8 

.7020, 

.7485 

.00 

401 

146 

172 

.849 

.86 

.85 

.03 

172.2 

29.8 

.8380, 

.872 

.00 

402 

78 _ 
95 = 

.821 

.84 

.83 

.03 

104.2 

20.8 

.8160, 

.8604 

.013 

403 

78 

95 

.821 

.84 

.83 

.03 

104.2 

20.8 

.8160, 

.8604 

.013 

404 

60 

63 

.952 

.94 

.93 

.03 

86.2 

6.8 

.9174, 

.9522 

.84 

405a 

60 _ 
63 

.952 

.94 

.93 

.03 

86.2 

6.8 

.9174, 

.9522 

.84 

405b 

57 

63 

.905 

.90 

.89 

.03 

83.2 

9.8 

.8811, 

.9228 

.47 

405c 

58 

63 

.921 

.92 

.91 

.04 

57.2 

15.8 

.8959, 

.9425 

.63 

405d 

51 

63 

.810 

.84 

.83 

.04 

77.2 

15.8 

.8103, 

.8622 

.02 

406 

89 

95 

.937 

.93 

.92 

.02 

115.2 

9.8 

.9117, 

.9431 

.82 

407 

56 _ 
63 " 

.889 

.89 

.88 

.04 

55.2 

7.8 

.8595, 

.9137 

.31 









achieving the mastery level in a successive sample, then the validation 
dates given in column 4 of Table 2 p(-n>.9) would be later dates but the 
estimation of true probability of success would be much improved. 

Lesson 20la has a 90% success rate in an observation of 99 students who 
entered the lesson after the validation date. May 28th. This observed 
success rate is the same as the validation criterion. It is interesting 
to note that the 50% HDR [.88,.916 ] of the new prior density based on 
the sample size of 129 is slightly narrower than that.of size 30 [.8714, 
•9244]. In general, when the number of students increases , the 50% HDR 
gets narrower. Also you will notice that the value in the last column 
of Table 2 for lesson 201a is .43, which is larger than Prob(-rr£.9) = 

•409 when the sample size is 30. Therefore, our crediblity of saying 
that lesson 201a will have a success rate of 90% in the population from 
which this sample was drawn will increase if the sample size on which 
the model density was based increases. 

Hence, setting the most appropriate validation criterion for a lesson 
depends on two factors: success rate and sample size. The discussion of 
these two factors will be carried mathematically parallel, ,in other words 
mathematically dual; taking the sample size as the number of items or the 
test length, the success rate as the proportion of getting a correct answer 
for an item. In the next chapter, we will switch the focus from the former 
that is oriented toward the success rate of a lesson,to the latter that is 
for the success rate of an individual in a test. 
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CRITERION REFERENCED TEST AS ASSESSMENT OF STUDENTS PERFORMANCE 


3.1 Problems in Criterion-Referenced Tests 


Criterion-referenced testing has gained much attention from 
educational measurement and testing specialists in recent years. The 
object of criterion-referenced testing is not to distinguish finely 
among subjects, but to classify subjects into mastery and non-mastery 
groups. Robert Gleser (1963) stated that the measures of CRTs depend on 
an absolute standard of quality while those of NRTs depend on a relative 
standard. CRTs are often used in conjunction with instructional 
programs that maximize the number of students attaining a given mastery 
level and minimize the variability of test scores while norm-referenced 
tests (NRTs) are used in selection or screening a subgroup of examinees, 
predicting students' future performances, and evaluation of 
instructional programs. 

The concepts of criterion-referenced testing are quite different 
from those of norm-referenced testing. Strictly speaking, the test 
scores of NRT are assumed to be distributed normally while those of a 
CRT are highly skewed. The variability in scores of a NRT is large 
while that of a CRT is small. Although, these differences are generally 
expected hut need not be observed in practice. Statistical measures in 
the classical test theory model, such as reliability and validity, are 
defined on the basis of assuming that the standard deviation of any NRT 
is always positive and adequately large. Therefore, the definition of 
rel iability as the ratio of true score variance to observed score 
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variance can be a meaningful index there. The reliability tends to 
increase as the test length (number of items) increases and hence the 


variablity of test scores increases. The test length of a CRT is 
usually short, say 10 or 15 items, and often most items of a test are 
answered correctly by all students who take the test. Therefore the 
reliability of a CRT can't be satisfactorily large. As far as this 
author knows, many tests have a <*21 reliability of only about 0.5 or 
less. 

Since it is a common use of criterion-referenced testing that all 
students are expected to achieve the level of mastery, say 90% correct, the 
observed scores become a bounded variable. If there are subjects with true 
scores near the "ceiling" or the "floor", it becomes implausible to assume 
that the errors of measurement are distributed independently of true scores 
for those near the boundary. NRTs don't usually have ceiling or floor 
effects. Their scores are distributed around the mean score and are 
seldom near either extreme. In such a test, it is reasonable to assume 
that error scores are due to something independent of the subject's true 
abilities, such as fatigue, anxiety, etc. 

Lord and Novick (1968) argue about the plausible distributional forms 
of observed CUT scores and true scores in Chapter 23 of their book, 
"Statistical Theories of Mental Test Scores." We will follow their steps 
and adopt the binomial error model for CRT scores. The binomial error 
model assumes that if each MVE test is aimed at measuring the learning 
level of a topic taught in the Vehicle Training Course, then all items in 
the test must measure the same task. In other words all items in a test 
have one and only one common factor with 0-1 scoring. Suppose there is a 
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pool of Items measuring the same task, and taking an Item out of the pool 
Is an Independent event, that is, answering the earlier items on the test 
does not affect the ability of a student to answer later items correctly, then 
we can formulate the distribution of raw scores x by a binomial 
distribution with parameter 9 in which 9 is the proportion of items that a 
student would answer correctly over the entire pool of items. If T is a 
fixed true score and e is an error of measurement, then the raw score x can 
be expressed by the sum of the two, x = T + e, and 9 is given by 
9 - T/n 

where n is the number of items in the test. Let h(x|9) be the binomial 
distribution of x at any given true ability level 9, then the conditional 
distribution h(x|9) can be given by 

h(x|9)= ( n )©x(l-8)n-x x = . .. 

where n is the number of items in the test. 

It is interesting to note that this model does not pay attention to 
item differnces. The traditional measurement indices such as item 
difficulty or items discriminating index are not the major concern in 
the binomial error model. Rather, finding out how accurately a test can 
estimate an examinee's pass or fail status with respect to a given 
mastery is a main concern of the model. 

Keats and Lord (1962) investigated the relationship between the 
distibution of test scores, observed and true scores. The test scores 
could be adequately represented by the hyper geometric distibution h(x) with 
a negative parameter and the true scores distribution could be represented 
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by the two parameter beta distribution g(9). 


g(9) = ©a-1(l-a)b-n/ B(a,b-n+l) 


where a>0 and b>n-l. And also 


h(x) = 


J i ©a-l (1 _o)b-n fn . 

- J 9 x (l-9) n-x d9 , x=0,l,...,n. 

0 B(a,b-n+l) \ x' 


In classical test theory, the estimation of a true score is given by 
regressing the true score T on the observed score x, and the equation is 
given by 


E(T |x) = /x + (l-f)u x 


where / is the reliability of the test and u x is the mean of test scores. 

In binomial error model, the estimation of a true score is given by 
similar equation. 


E(T|x) =o^21 x+ , x=0,l,...,n 


whereo( 21 is the ratio of number-correct true-score variance to observed- 
score variance and is given by 


<7-^1 n C u x( n-u x ) 


n - 1 '■ 


1 - _-- 


Table 3 is the summary of information from the Mastery Validation Exams 
at Chanute. 
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Table 3 


The Summary of Simple Statistics of Mastery Validation Exams 


test 

mean 

SD 

items 

ot 21 

N 

tnvel03 

7.388 

1.124 

8 

0.6321 

85 

mvel04a 

11.892 

0.442 

12 

0.4910 

83 

mvel04b 

10.120 

1.728 

11 

0.8018 

83 

tnvel05 

7.706 

0.737 

8 

0.5470 

85 

mve201a 

9.474 

0.973 

10 

0.5254 

76 

mve201b 

8.907 

1.325 

10 

0.4951 

86 

mve202a 

16.186 

2.934 

20 

0.6753 

97 

mve202b 

9.720 

0.634 

10 

0.3573 

82 

mve204 

8.557 

1.681 

10 

0.6253 

88 

mve205a 

6.767 

1.558 

9 

0.3470 

90 

mve205b 

8.110 

1.736 

10 

0.5457 

82 

mve206a 

12.038 

1.574 

13 

0.6942 

78 

mve206b 

15.250 

1.619 

17 

0.4259 

80 

mve206c 

19.257 

1.151 

20 

0.4841 

70 

mve207 

3.761 

1.124 

5 

0.3287 

88 

mve301 

8.727 

1.501 

10 

0.5635 

77 

mve303 

17.380 

2.257 

20 

0.5824 

71 

mve304 

9.209 

1.366 

10 

0.6771 

67 

mve305 

7.458 

0.934 

8 

0.4806 

72 

mve307 

14.683 

1.522 

16 

0.5101 

63 

mve308 

9.037 

1.170 

10 

0.4045 

82 

mve40l 

9.254 

1.015 

10 

0.3673 

63 

mv e 402 

14.138 

2.335 

17 

0.5988 

94 

mve403 

8.095 

2.487 

10 

0.8340 

84 

n»ve404 

4.254 

0.876 

5 

0.2166 

67 

mve405a 

9. 169 

1.069 

10 

0.3701 

71 

mve405b 

8.329 

1.991 

10 

0.7208 

70 

mve405c 

9.087 

1.222 

10 

0.4934 

69 








In classical test theory,c( 2 1 (Kuder-Richardson)) is always smaller 
than or equal to the other reliabilty approximations ,such as°^ 20 and 
Cronbach's coefficient^. Both<^ 20 and°^ 21 become equal only when all 
items are of equal difficulty (or have equal mean if the scores are 
dichotomous, and note thatc^o would be used in place of*^ 21 with a 
compound binomial model). Coef f icient ot becomes equal to^20 ii all 
items in a test are parallel, that is, all items have the same mean 
values and variances in classical test theory. As we previously noted 
in this chapter, the binomial error model assumes a single common factor 
and is not concerned with differentiating among item characteristics. 

The model does not require any information about the item 
charactersitics in a test, such as difficulty and discriminating index, 
but it does require knowledge of the number of items on a test. It is 
interesting to note that the mathematically derived ratio of the true 
and observed score variances in the model becomes equal to the 
reliability of the test where all items are of equal difficulty and 
variance. Therefore the definition of reliability in classical test 
theory loses an interesting feature in terras of a traditional sense 
because in the binomial error model, the value of the reliability index 
is reduced to that of the lowest approximation to the ratio of the true 
and observed score variances in classical test theory. Since*^ 21 is a 
special case of reliability approximations when item differences are 
ignored, it is exactly what we can expect out of the binomial model. 

The conceptualization of reliability is no longer important in the 
model. Instead, the accuracy of judging non-mastery and mastery status 
of examinees becomes a main concern. Millman states this purpose of CRT 











clearly In his paper (1975), and discusses how many items must be 
administered from a given item-pool so that the test items in the domain 
answered correctly can give an accurate estimation of an examinee's true 
ability 0. 

Setting of Mastery Levels 

The mastery level of Master Validation Exams (MVE) of the 37 
lessons in the Chanute PLATO AFB CBE Program was set at a level of 80%, 
although it is impossible to prove that 80% is the most appropiate level 
for their program. Block (1972) showed in his experimental study that 
attainment of a 95% mastery level maximized student learning of 
cognitive tasks in his matrix algebra course, while an 85% level 
maximized learning as characterized by affective criteria. 

Since Chanute's 37 lessons are designed to be "homogeneous" with 
respect to content and teaching style, all lessons are written under the 
same principle with the same tutorial logic, although the subject matter in 
each lesson is different. Therefore Chanute's lessons are not linearly 
related and the content difficulty of the lessons is not hierarchically 
ordered as it would be in teaching mathematics, arithmetic, or foreign 
languages. If the lessons are linearly related, setting a mastery level 
for the earlier instructional units should by higher than those of the 
later instructional units. If the goal of the second unit is the 
attainment an 85% mastery level, then the mastery level of the first unit 
might be 90%, or some other level higher than 85%. Since there is no 
analytical technique to provide the optimal level of mastery learning. 
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definite statments about the determination of ideal mastery levels cannot 
be made at this time. Linn (1978) provides an excellent discussion 
of the topic of "setting standards". 


Cutoffs 


Mastery levels are usually set by instructors or the author of a 
lesson, but the decision of mastery and non-raastery is based on examinees' 
observed test scores. The score that is used as to decide mastery and non¬ 
mastery is called the "cutoff." Mastery and non-mastery status ought to be 
defined on the basis of true ability 9, not observed test scores x that are 
subject to measurement errors. If true ability were known, there would be 
no incorrect classifications. Unfortunately, true scores are impossible 
to obtain in practice, so we have to find a way to minimize 
misclassification. 

There are four kinds of classifications: 1. an examinee's true 
ability 0 and observed score x are both higher than a given mastery level 
9 0 and cutoff score c, that is A = { xic and 0i0 o }; 2. 0 is lower than 0 O 
and x is also lower than c, that is B = { x<c and 0<9 O );3. 0 is lower than 
0 O> but x is larger than c, F+ = { xic and 9<0 O >; A. 0 is higher than 0 O> 
but x is lower than c, F- = { x<c and 0i9 o )• The following figure shows 
these four conditions. 



9 * true ability, x = observed score 

®o» true mastery level 
c ■ observed cutoff 

Probability of these events will be denoted 

by P(A) ,P(B) ,P(F+) and P(F_) respectively 
25 


Figure 1 










Millman (1975), and then Novick & Lewis (1975) reported percent of 
students expected to be misclassifled for a given cutoff with various 
numbers of test items. Millman used the binomial error model, but Novick 
and Lewis used the Bayesian beta binomial error model. 

According to Millman's calculations, the percent of students expected 
to be misclassified at 80% mastery level using a 10 item test could be as 
high as 53%. 

Emerick (1972) and Huynh (1976) considered the loss ratio Q of F- to 
F+ as a means of controlling misclassification, especially false 
advancement. If later instructional units require the knowledge and skill 
acquired in earlier units, false advancement will be a problem. 

Since F- stands for the event in which a student has really mastered 
the given instructional unit but his/her observed score happens to be 
lower than the cutoff, retaining such a student in the same unit is not 
efficient. If the instructional units are fairly Independent from one 
to another, as are lessons in the Vehicle Training Program at Chanute 
Air Force Base, then an appropriate loss ratio would be 1, or at least 
it is not necessary to set it as high as 10. 

Huynh (1976) proposed an evaluation of the cutoff score that minimizes 
the occurence of misclassifications for a given loss ratio. With his 
cutoff score, the loss ratio associated with the probability of having 
the false positive to that of false negative stays the same, say 10, 
while the linear combination of the probabilities of the both events and 
the loss ratio (the average loss) is minimized. We will discuss in more 
detail Huynh's method in conjunction with 34 Chanute lessons and their 
MVE test scores. 
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3.2 _ Evaluation of the optimal cutoff scores 

Huynh derived the optimal cutoff c Q of a test for a given mastery 
level 0 O and loss ratio Q so as to minimize the average loss function R(c) 
by differentiating it, where R(c) is the linear combination of the probability 
of false positive and false negative and is given by 

R(c) = P(F+) + Q P(F-). 


c Q is the smallest integer such that the incomplete beta function of 
lQ o (a+c o ,n+b-c 0 ) is smaller than or equal to Q/(l+Q) ; where 


P(c 0 ) 


Ig(a+c 0 , n+b-c Q ) 



ga+c D - 1 ( 1 -©)n+b-c G -1 


B(a+c 0 , n+b-c G ) 


d© 


In order to apply Huynh's result to evaluate c D , we need the help of a 
computer to calculate the values of the incomplete beta function for 
c=0,l,2, ...n and plot them on paper. The PLATO system eases these steps and 
we can obtain the answer through the program "cutoff" written by the 
present author and T. Weaver. Figure 2 illustrates the procedure to 
determine the optimal cutoff c 0 . The parameters a and b are obtained 
from the mean, standard deviation of the test and the number of items in 
the test (denoted by n). Table 4 shows the values of incomplete beta 
function I© Q ( i) at each point i=l,2,... n, where a,b are calculated from 


test scores of MVE201a by the formula 










Figure 2 

Determining the optimal cutoff Cq as to minimize misclassification 

lesson ■ MVE201a subjects “76 n ■ 10 

mean - 9.4737 SD - 0.9726 x, - 0.53 

a - 8.5560 b “ 0.4753 
















a ■ (- 1 + 1 / 21 ) u x 
b = -a-n+n/ 21 . 

Table 4 

Ten points in Figure 2 


e Q - . 80 , 

item 

Test 

afi 

=mve 201 a, 

n+b-i 

a=8.5560 b =0 

I© Q (a+i,n+b-i) 

1 

9.556 

9.475 

0.998 

2 

10.556 

8.475 

0.991 

3 

11.556 

7.475 

0.969 

4 

12.556 

6.475 

0.913 

5 

13.556 

5.475 

0.7% 

6 

14.556 

4.475 

0.608 

7 

15.556 

3.475 

0.376 

8 

16.556 

2.475 

0.169 

9 

17.556 

1.475 

0.045 

10 

18.556 

0.475 

0.004 


The curve in Figure 2 is obtained by plotting the points in Table 4. 

The horizontal lines which are marked by losses 0.5, 1, 1.5, 2,...,20 in 
Figure 2 help to evaluate the optimal cutoff which minimizes the average 
loss R(c) at c c for the partially known loss ratio Q and a given mastery 
true level 0 O . Since the contents of all lessons discussed in the 
Chanute PLATO AFB CBE Program deal with independent topics across the 
lessons and the lessons are not linearly or hierarchically related, a 
loss ratio of 1 will be reasonable. Thus, in Figure 2 the smallest 
integer value of i for which the curve P(i) goes under the line of loss 
ratio 1 is 7. Therefore c 0 «7 is the ideal cutoff score of the 
test, MVE201a. 

It is interesting to note that the cutoff score, c* 8 , actually used 
for MVE201a in the Chanute training program gives a slightly larger value 


29 













of the probability of misclassification of (R(c)=P(F+)+P(F-)) than the 
theoretically derived c D does, but not for P(F+), probability of false 
positive, or P(F-), probability of false negative separately. 


c-1 n, 

P(F+) = Iq (a,b)-(l/B(a,b)) 5 ( lB(a+i,b+n-i)I e (a+i,b+n-i) 

i'O'i' 


P(F_) = (1 /B(a,b)) Z / Wa+i.n+b-i) (1-Iq (a+i,b+n-i)) 
i=0U / 0 


The probabilities of P(A)=Prob{9i© 0 ,xic> and P(B)=Prob{0<0 0> x<c} 
are given respectively by the following formulas: 


P(A) - 1-1© (a,b)+(l/B(a,b)) S f )B(a+i,n+b-i)(I q (a+i,b+n-i)-l) 

i=0Ni' 


c-I ( n, 

P(B) = (1/B(a,b)) 2 I B(a+i,b+n-i)I© (a+i,b+n-i) 
i=0'i' ° 

The probability of each misclassification for all available HVEs 
were calculated and summerized in Table 5. 

Since the sum of the probabilities A, B, F+, and F- is 1, the sum of 
the probabilities of A and B must have a maximum value at c G where 
P(F+)+P(F-) reaches the minimum as shown in Figure 3. 

In Figure 3, the curve of P(F+)+ P(F-)(the lower curve drawn * is) 
decreases slowly until it reaches the bottom at c D , then increases as 
the number of items increases while the curve of P(A)+ P(B)(the upper 
curve drawn with + is) reaches the maximum point at c 0 . 
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Figure 3 

Graph of P(F+) + P(F_) over cutoff scores 
lesson * MVE201a n ** 10 C n = 7 ®n = 







Table 5 indicates that the actually used cutoff scores c produce 
higher probabilities of P(F+ or F-) than the theoretically determined 
cutoff c 0 s except in a few cases. Since the theoretical cutoffs are 
determined so as to minimize the average loss R(c), in our case the sum 
of probabilities of false negative F- and false postive F+, all values 
in column 5 of Table 5, P(F+)+ P(F-) have smaller values for c c than for 
c. The sum of the probability of A and F+ is the expected success rate 
, so this sum matches the observed success rate given in the last column 
fairly well. 

The probability of each misclassification for all available MVEs 
were calculated and summerized in Table 5. 

Since the sum of the probabilities A, B, F+, and F- is 1, the sum of 
the probabilities of A and B must have a maximum value at c Q where 
P(F+)+P(F-) reaches the minimum as shown in Figure 3. 

In Figure 3, the curve of P(F+)+ P(F-)(the lower curve drawn * is) 
decreases slowly until it reaches the bottom at c Q , then increases as 
the number of items increases while the curve of P(A)+ P(B)(the upper 
curve drawn with + is) reaches the maximum point at c D . 

If c G were used as cutoffs for MVE test scores, only 12 lessons would 
not have a probability of observed success less than .90, which was used 
as the lesson validation criterion in the PLATO AFB CBE program, while 
18 lessons have values in P(A)+P(F+)(i.e. p(xic)) when c's are used. 

Since the probability of false negative, P(F-) stands for the case 
that an examinee really mastered the goal of instructional unit but his/her 
observed score happened to be lower than the used cutoff c, he/she does not 
have to repeat the instruction. If efficiency of training in terms of 
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shortening the training tine is the main concern, then P(F-) should not be 
so large. For example, MVE207 has P(F-)= .1957 which means 
88x0.1957=17, out of a total of 88 students repeated the sane 
instruction unnecessarily. Of course this is an extreme case and most p 
values are I ss than 10%, which means that five to eight students 
repeated the same lesson mistakenly. Table 6 shows the number of 
students misclass ified in Master Validation Exams. Since the observed 
cutoff c for all MVEs but MVE 207 are larger than or equal to the 

optimum cutoff c 0> the num ber of nisclassified students of the type F+ 
becomes larger for using c Q than c, and errors in the type F- turn to be 
smaller for c D . But the total mi3classifications are minimized by using 
c 0 • It is a problem of the tradeoff how the cutoff be selected. 

Since the loss ratio of 1 was selected in our study, we conclude 
that most cutoffs of Master Validation Exams used at Chanute were not 
the best choice. By adopting the theoretically derived cutoff c 0 's the 
probability of misclassifications could have been minimized. 

The probabilities of success rate by observation, prob(xic), or 
prob(A or F + ), suggest that the validation criterion of lessons in the 
Chanute program must be changed. Twelve out of 27 lessons have a 
passing probability of less than .90, even if the theoretical cutoff c Q 
had been adopted instead of the actually used cutoff score c. Those 
lessons which have failed apparently need more attention from the 
instructional designers, hut at the sane tine their tests need to be 
reviewed too because we don't know the cause of nisclassifications in i 
test. The investigation along this line will be taken in the next section. 

It should be noted that the dotted curve in Figure 2 decreases 
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is the theoretically derived cutoff to minimize P(F ) + P(F 










slowly for the smaller K values (No. of items in a test) but starts 
dropping rapidly until K reaches K=9 and again slows down. The shape of 
the curves varies a quite bit among MVEs and some start dropping rapidly 
at around K=7 or 8 for 80% true mastery level. Thus, the loss ratios of 
8 and 20 can have the same optimal cutoff for the same true mastery 
level. This is due to that the beta binomial model deals with 
continuous scores while the real data are discrete. 
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VALIDATION OF LESSONS AND CRITERION REFERENCED TESTS 


4.1 Predicting the Percentage of Success Rate for the Lesson 

Table 7 shows the estimated probability of success in terms of 
the proportions of true score to the number of test items, or true ability 
level ©. These calculations are based on error free true ability level © 

, so it is more reliable compared to the values obtained in Table 2., 
where values were calculated from the observed scores. 

Since P(-n>.9), the probability of 90% of the examinees achieving 
mastery, was based on the observed success rate and sample size, their 
values don't reflect the information from tests, such as test length, 

21, mean and standard deviation of a test. 

However, the probability P(F+ or A) is derived from unique Information 
obtained from each test; hence we can consider it more accurate than 
P(tt i«9). The lessons which have values larger than .90 for 
P(A or F_) and P(A or F+) might not require any further 
revision but others might need it. Lessons 105 and 308 probably 
won't require any further revision, but 204,207,303,304,402, and 405b 
might need revision of lesson or tests in spite of not being recommended 
according to the validation criterion that has been used in Chanute 
program. The probability of PASS based on the observed scores tends to 
provide larger values, so that the validation criterion based on the 
probability of true ability level P(A or F -)(i.e. p(©i© 0 )> will be more 
plausible standards. 

It is Important to note that these lessons may not really need 
revision; instead, the result may be due to poor test construction. So 


38 












•'» !o ’ 

Validation Criteria 


Ionson 

T '(.\ + T '_) 

"f* %) 

__ T, 4n>.nO) 

ferment f ron Ouii^an etnl. 

1 01 

.99 

.n? 

.14 

’’ccomen^H v onitorinf > ) 

1 OA a 

.no 

.nn 

.17 


in/,v 

. *>/>** 

.17** 

.eo 

P >« 

• • 


_. n*_ 

. °7 

_ __._^1 _ 

_ r_. m. _ 

301 a 

.O'; 

."51 

.43 


miv 

. "4 ** 

.07 

o 

revision "ocormended fo.*.) 

?.r>? a 

.40** 

.41 ** 

.10 


202h 

.no 

.9° 

.07 


204 

.71** 

.71 ** 

.43 


20S n 

. 11** 

.34 ** 

.on 

r.* 

2niv 

.40** 

.40 ** 

.19 

P # M 

2 04 a 

. 90** 

.02 

.03 



. no 

.°7 

.01 


204 c 

.no 

no 

• 

• 44 


307 

.41** 

.11 ** 

_.47 _ 


3 n l 

.71** 

• 

id 

*o 

* 

* 

0 

o.»*. ' 

im 

.01** 

,°n ** 



in/. 

.°7** 

.90 ** 

.04 


ini 

.ni 

.Oo 

.04 


107 

."1 

.OO 

o 


in" 

.no 


0 

r.n 

AOl 

• °1 

.or 

.0* 


402 

. 47 ** 

.7" ** 

.01 


A03 

.44** 

.70 ** 

.«l 

r» M 

• 

404 

. 74** 

.04 

.04 


4° la 

.n/. 

. on 

.47 


401V 

.41** 

.73 ** 

.41 


4ni c 

. on** 

."1 

.41 



** rocormended revision o r lesson or test 


39 










far, the only available technique to measure the quality of lessons is 


to examine the result of a CRT given at the end of the lesson# If the 
test is constructed very poorly (e.g. MVE 207, with P(F + or F_) * .2992, 
< *2i = *3287), then the measure will be unfair to question the quality of 
the lesson. The measure does not distinguish between the test and the 
lesson. Thus, the faulty part may be the test and/or any other part or 
parts of the lesson. This argument can also be applied to the reverse 
situation. Therefore, construction of a good test will be a key point 
in judging the quality of a lesson that will be indirectly measured by 
this test. 

4.2 Validation of Mastery Validation Exams 


In the previous chapter, we discussed the optimal cutoff c D of a CRT 
with respect to Mastery Validation Exams in the PLATO AFB QBE Program at 
Chanute Air Force Base. 

The evaluation study of the program, supported by Advanced Research 
Program Agency, measured some criterion variables which would be 
helpful in conducting a validation study of MVEs. The evaluation study 
revealed that a substantial number of examinees were miclassifled(Table 
6). Since detailed information on the design used in the evaluation 
study can be found in Dallman et al. (1977), just a brief description 
will be given here. 

A 50-item NRT was given at the beginning and end of the eight-week 
PLATO AFB CBE Program, which included 37 on-line lessons. The 37 lessons 
were divided into four subsets called Blockl, Block2, Block3, and Block4. 

1 

I 
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After a student studied and mastered all lessons In a block, he took the 
block test; the block test score was counted in his final grade for the 
course. He had to take all four block tests, and then a posttest was given 
in order to measure the effectiveness of the program. Each block test had 
twenty items which were either multiple choice or matching. Hie 
coeeficient alpha reliabilities were not calculated because the tests 
were writtten on the PLATO system and the item information was not 
collected. But^ 21 was available in the following chart. Figure 4 
gives a flow chart of the testing program. 

In order to validate the effectiveness of lessons, four kinds of 
correlations were calculated. These correlations are described in the 
following paragraphs. 

Each Block's test scores were matched with the corresponding Master 
Validation Exam scores and the time needed to master the lesson (mastery 
time), and their correlations were calculated over the subjects. These 
two correlation values of 27 lessons were denoted by r(B,MVEs) and 
r(B,time) respectively. Their values are shown in Table 8. 

The true gain scores of posttest, X£, from pretest, x, were 
estimated by multiple regression procedure; the true score difference 
t 2~ t l of the observed score difference X2~xj was regressed on the post- 
and pretest scores. It is known that the regression of t2~ti onto the 
two variables xj and X2 are the same as regressing t2 - ti on the scores 
x 2~ x l an( * the residual score, C2 ? of X2 on X2-xj (Tatsuoka, 1975), 
because the covariance of X2~xi and C2 equals zero and both X2~xi 
and C2 are linear combinations of and X2* 

Therefore, the multiple regression Mt2" t ll x 2“ x l) will be given 
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Figure U 

Block diagram of student flow through PLATO-based portion of 
Automotive Course 





































as the sun of the regression of R( t2~til x2-xj) and R(t2~ti|c2> 


R(t2-ti|x2, xi) = R(t2-ti|x2-xi)+ R(t2-ti|c2). 

Note that the regression coefficient of the first term is the 
reliability of gain scores and that of the second terra is the Increment 
of multiple R^. The multiple R is .861, hence the reliability of the 
multiple regression gain score is .7405. The first term, the simple 
difference score has the reliability of .1047 , the second term is 
.6358. 

This estimated gain score has a higher reliability than those of 
pretest and posttest separately. This score was correlated with MVE 
scores and mastery time. Table 8 shows the result. 

The optimal cutoffs that were evaluated in the previous chapter 
were divided by number of items in the corresponding Master Validation 
Exam. The same operation was used for the difference of the mean from 
the observed cutoff c G in each MVE. This value expresses the distance 
of c Q from the mean in each test. The summary description of these 
variables and the correlation matrix are given in Table 9. 

The probability of false positive (or advancement), P( F+) has 
correlation values of -.562, -.659, .638 with 'nafter', (mean-c 0 )/n, and 
P(F- ) (false negative or attainment), respectively. This means that 
the misclassification of false advancement tends to occur more often 
when the observed cutoff c Q is closer to the mean. The test which 
advances the students to the next lesson more frequently by mistake 
tends to retain the students whose true scores are really above the 
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Table 8 


The correlations of Block tests to MVE scores and mastery time 


lesson 

• r(B, MVEs) 

r(B, time) 

r(G, MVEs)) 

r(G, time) 

103 

.15 

-.22 

.23 

-.38* 

104a 

.38* 

-.33* 

.19 

-.43* 

104b 

.36* 


.44* 


105 

.22 

-.08 

.20 

-.34* 

201a 

.34* 

.12 

.44* 

-.05 

201b 

.19 

-.25 

.38* 

1 

• 

o 

* 

202a 

.17 

-.04 

.07 

-.43* 

202b 

.26 

-.03 

.28* 

-.07 

204 

.21 

-.21 

.11 

-.13 

205 a 

.28* 

-.24 

.18 

-.32* 

205b 

.25 

-.08 

.15 

-.26 

206a 

.40* 

-.21 

.13 

-.22 

206b 

.12 

-.04 

-.02 

-.18 

206c 

.00 

— 04 

.33* 

-.08 

207 

.28* 

-.17 

.25 

-.27 

301 

.04 

-.08 

-.11 

-.06 

303 

.34 

-.21 

.08 

-.05 

304 

.38 

-.27 

.42* 

-.37 

305 

.07 

-.19 

.31* 

-.26 

307 

.30* 

-.23 

.41* 

i 

• 

o 

* 

308 

.01 

.04 

.00 

-.07 

401 

.50* 

-.15 

.32* 

-.21 

402 

.25 

-.14 

.46* 

-.34* 

403 

.40* 

-.23 

.21 

-.02 

404 

-.02 

.00 

.02 

-.33* 

405a 

.07 

.01 

.12 

-.11 

405b 

.25 

-.06 

.17 

-.12 

405 c 

.37* 

-.11 

.19 

-.07 

*signifleant 

at p < .05. 
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Table 9 


A Correlation Matrix with Summary Description of Variables 


Variable 


Description 


false positive 

theoretical cutoff divided by number of items 

the ratio of true variance to observed variance 

probability of misclassification 

number of subjects using a lesson after it was 
declared to be validated 

observed percentage of failure in MVE 

Baysian estimate of success rate in the population 

maximum mastery time minus minimum mastery time 

correlation of gain to MVE scores 

correlation of mastery time to gain 

correlation of blocktest to MVE scores 

correlation of blocktest to mastery time 

number of items in a test 


from the mean, c :observed 
o 


relative distance of c 


1 P(F + ) 

2 c /n 

o 

3 a 21 

4 P(F ) + P(F_) 

5 nafter 

6 %fail 

7 P(ir > .9) 

8 range 

9 r(G, MVEs) 

10 r(G, time) 

11 r(B, MVEs) 

12 r(B, Time) 

13 items 

14 mean - c 
_o 

n 

15 P(F_) 


1 2 3 


1 

1.000 



2 

.250 

1.000 


3 

-.006 

.358 

1.000 

4 

.931 

.393 

-.020 

5 

-.562 

-.373 

.037 

6 

.111 

. 167 

.384 

7 

-.211 

-. 156 

-.347 

8 

.265 

.621 

.213 

9 

-.283 

-.244 

.090 

10 

. 183 

-.233 

-.259 

11 

-.199 

.051 

.324 

12 

.027 

.053 

-. 316 

1 1 

-. 108 

-•271 

.079 

14 

.659 

.510 

.244 

15 

.638 

.542 

.079 


o 

false negative 

4 5 6; 


1.000 




-.617 

1.000 



.165 

.335 

1.000 


-.226 

-.265 

-.903 

1.000 

.345 

-.304 

.206 

-.113 

-.264 

.271 

.032 

.053 

.054 

-.099 

-.460 

.386 

-.102 

.125 

• 28 b 

-.368 

-.056 

-.133 

-.320 

.355 

-.211 

.426 

.385 

-.339 

.855 

-.489 

.408 

-.396 


8 9 10 11 


1.000 




-.074 

1.000 



-.414 

-.377 

1.000 


-.192 

.403 

-.275 

1 . noo 

-.120 

-.235 

.520 

-.468 

.070 

.231 

-.190 

-.034 

.415 

-.119 

-.193 

. 141 

.417 

-.196 

-. 171 

.099 


869 -.544 .293 -.281 


Note . All correlation values were transformed by Fisher's Z transformation. 
Probabilities were transformed by sin (>' / iP). 
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(Table 11 cont.) 



12 

13 

14 

15 

12 

1.000 




13 

-. 159 

1.000 



14 

t . 228 

-.119 

1.000 


15 

-.176 

-.264 

.956 

1.000 
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mascery level. The correlation of -.659 with the variable, the number 
of students who studied a lesson after the validation date was set, (If 
over 90% of students pass the mastery level of a MVE, then the lesson 
was said to be validated.) indicates that the probability P( F+) will be 
small if the lessons whose validation date were established at an 
earlier date during the period of evaluation study at PLATO program. 

This relation is true for the variables P( F+ or F_) and P(F-) because 
the correlations of variable nafter with them are -.617 and -.544 
respectively. Moreover, P(F + ),P(F_) and P(F+ or F_) correlate highly with 
variable (mean-c 0 )/n with the values of -.659, -.855, and -.956 
respectively. But the correlations between 'nafter' and (raean-c 0 )/n is 
significant, at -.489- Hence, we cannot state that lessons which were 
quickly validated will produce less chance of misclassification. Since 
the correlation of (mean-c 0 )/n and nafter is -.489, which is 
significantly high, the cutoff c D associated with some of these Mastery 
Validation Exams might have happened to be chosen closer to the means of 
corresponding MVE exams respectively. This fact raises a question about 
the properness of the validation criterion that has been used in PLATO 
Service Program at Chanute. 

A stepwise multiple regression procedure was performed on the 
fifteen variables, and three predictors were selected to predict the 
variable P(F + or F_). Table 10 gives a summary of the analysis. 
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Tab1e 10 


Estimation of P(F + ) + P(F_) by Stepwise Multiple Regression 


variable 

coefficient 

S.D. error 

t 

21 

-. 193 

.088 

2.193 * 

nafter 

-.205 

.098 

2.092 * 

r(G, time) 

.144 

.089 

1.618 

(mean-c G )/n 

.829 

.102 

8.127 ** 

Miltiple R = 

•9101, constant = .60, F 3^23 

= 30.305** 


^significant at p<.05 **at p<.01 


The first predictor (raean-c Q )/n for the criterion P(F + or F_), 
variable 4 has a beta coefficient of 0.792 and significance test of t- 
value 7.9. This result is expected, but entering as the second 

predictor in the analysis is surprising. If oi-il is high enough, then 
the probability of P(F + or F_), occurrence of misclassification, will be 
minimized, tfost Master Validation Exams have reliabilities of 
around .4 to .5 which is quite low, so it is natural to expect that 
misclassifications will have occurred quite frequently in the program. 

The variable 1 * 2 \ does not correlate significantly with variable 13, 
number of items in the tests; it correlates with variable 6 , percentage 
of failure at the 5% significance level. This relationship may be 
interesting to investigate further , especially when the test lengths 
are short and about the same containing 10-15 iteras as is customary in 
criterion-referenced tests. 



The following picture might help for 
quick, intuitive grasp of the relationship 
between F+, F- , variables c o, n and u x . 
The areas of marked F+ and F- depend on 
on u x —C q, n-u x . 
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Table 11 


Relationship between the optimal cutoff c 0 /n and other variables 


variable 

coefficient 

S.D. error 

t 


<* 21 

.296 

.142 

2.085 

* 

range 

.583 

.141 

4.135 

* * 

no. of items 

-.362 

.139 

2.604 

* 


Multiple R = .7528 , constant = .56, F 3 ,23 = 10.027** 

*significant at p<.05 **at p<.01 


Table 11 gives the results of a stepwise multiple regression 
analysis where the criterion is the optimal cutoff c Q divided by n. 
Entered predictors are variables 8, 13, and 3. t-tests of significance 
for the beta coefficients indicate that all three variables are 
significant at p<.05« Since variable 8 is the range of time(the 
difference between the maximum time needed and the minimum time), the 
longer the time span needed by students to master a lesson, the larger 
the ratio of the optimum cutoff to the number of items will be. It 
should be noted that the procedure of evaluating the optimal cutoff c 
does not depend on the time needed to complete or master a lesson. But, 
if c/n is relatively higher, then there is more failure, both F- and 
correct failure, B in Figure 5, resulting a larger range in the mastery 
time of a lesson.21 is again among the predictors and if 0 ^21 * s 
larger, then c/n becomes more affected by it. This analysis needs to be 
more refined since a better way to interpret the results should be 
found• 
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Table 12 


Relationship between r(G, MVEs) and other variables 

variable beta coefficient S.D. error _t_ 

c/n -.336 .181 1.856 x 

p (tt> . 9) .207 .190 1.089 

r(G, MVEs) -.535 .193 2.772 * 

Multiple R = .5430 , constant = 0.27, ^3,23” 3.206 * 

*significant at p<.05 x significant at p<.10 

Table 12 shows the results of a similar analysis, using the 
correlation of gain scores and Mastery Validation Exam scores as the 
criterion. A larger value of this variable means that the gain score 
was non-negligibly affected by the Mastery Validaion Exams, which have a 
large correlation value r(G,MVEs). We know from Table 10 that MVE 
scores of lessons 104b, 201a, 201b, 206c, 304, 305, 307, 401, and 402 
have significant values of correlation. This analysis revealed that 
correlation of mastery time to gain scores contributes the most 
significantly in predicting variable 9. Since mastery time of a lesson 
correlates highly with aptitude scores as shown in Table A of the 
Apppendix, this result is expected. 

The students affected most by the decision of cutoff scores are mediocre 
students whose scores are near the cutoffs, and therefore they tend to 
be more often misclassified in either the positive or negative way. The 
fact that the beta coefficient of variable 2 is -.336 means that the 
smaller the values of c 0 /n, the larger the contribution to the gain will 
be; thus mediocre students have a greater chance of repeating the 
lessons since the observed cutoff c was set to 80% across all MVEs, 
which is the true mastery level that was aimed for. 
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Table 13 


Relationship Between p(-n>.9) and other variables. 


variable beta coefficient S.D. error_t 


21 

-. 152 

. 178 

.854 

r(C, MVEs) 

.224 

. 185 

1. 211 

r(G, time) 

.305 

.190 

1.605 

no. of items 

-. 344 

.195 

1.966 x 

(raean-c 0 )/n 

.314 

.199 

1.954 x 

Multiple R = .6503 , 

constant = 1.09, 

F 5 , 21 = 3.077 < 

*significant at 

p< .05 

x significant 

at p<.10 


Table 13 shows the results of analysis when the criterion is 
variable 9, probability P(-rr2.9) that 902 or more of the students in the 
*next page 

population from which our sample was drawn will achieve the 80% mastery 
level on the end of lesson test. Five predictors among variables 1, 2, 
3, 4, 8, 9, 10, 11, 12, 13, 14, and 15 were selected. The variables 
nafter and % fail were omitted because P(tt>. 9) was derived from these 
two values in the sample. None of the beta coefficients was 
significant, but we might be able to say that P(tt>. 9) depends to some 
extent on the test length (beta=-.344, t=1.97). Also, the distance of 
the mean from the observed cutoff c 0 affects the value of p(tt>.9) such 
that if the observed cutoff c G is considerably smaller than the mean, 
then the success rate of the lesson becomes larger. This means that the 
test was probably too easy in comparison with other tests. This 
analysis result confirms that the validation criterion used at the PLATO 
AFB CBE program at Chanute Air Force Base depended excessively on the 
test, the characteristics of MVE; hence the method that was used to 
assess the quality of lessons was inadequate. There is a great need for 
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the development of a method to validate lessons directly, without 
depending entirely on the end of lesson tests. 
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SUMMARY AND DISCUSSION 


The problem of setting a validation criterion for a given lesson is 
inportant in practice, but it has never become a focus for educational 
researchers, although the closely related topic of criterion referenced 
test has been one of the most popular research targets in the past few 
years. Both the sample binomial model and the Bayesian binomial model 
(beta binomial model) are adopted to set a better validation criterion 
for a given lesson and the result from the latter model matched our data 
better than did the former. Therefore, the prediction of the future 
success rate of the lesson using the Bayesian binomial model is 
recommended for setting a validation criterion, when (a) the information 
is limited to the percentage of failure (or success) rate on the end of 
the lesson test and (b) an author (or instructor) of the lesson has a 
certain level of prior belief as to what extent his/her lesson will be 
successful. If the scores of a test given at the end of a lesson are 
available, then it is recommended to use the information that one can 
get from the test performance as much as possible upon setting a 
validation criterion of the lesson. Applying the beta binimial model of 
criterion-referenced testing, the estimated probability of the observed 
score X being larger than the observed cutoff c will be a better 
validation criterion than the success rate. In other words, the 
probability of mastery, passing the criterion score c will serve as a 
validation criterion of the lesson. 

Of course, the decision of mastery or non-mastery must 
theoretically be based on a student's true performance level and not on 
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the observed scores, but the true score will never be available in 
practice. But it is possible to estimate the probability of the true 
score being greater than or equal to a given true mastery level, say, 
80%. Unfortunately, we don't have any analytical method to determine 
the best, most suitable true mastery level for a program. 

The four kinds of probabilities — correct pass (A), correct fail 
(E), false positive (F+) and false negative (F-) — were calculated over 
27 Mastery Validation Examinations (a) when the observed cutoff c, (80% 
correct) and (b) when the optimum cutoff c 0> which minimizes 
misclassification of students, was used. The results indicate that even 
if c Q were used in the decision process, some tests still show 
substantially large numbers of nisclassifications of both the false 
positive and false negative types. Since it is interesting to 
investigate why some tests showed as much as about 20 % of 
misclassification while other tests showed very little, three stepwise 
multiple regression analyses were used to select the predictors of 

P(F+), P(F-), and P(F+ or F-) separately. The common strongest 

predictor was the distance of c G from the mean of a test, which was what 

we expected. The second common predictor was 21* t ^ le internal 

consistency of a criterion referenced test. As 21 increases to 1, all 
three criterion variables get smaller, hence less misclassifications 
occur. That means the internal consistency of the items in a given test 
is important to control false positive and false negative errors. 

The optimum cutoff c 0 's for Mastery Validation Exams are smaller 
than or equal to the actually used observed cutoff c's in almost all 
cases in the PLATO AFB CBE project. Therefore the probabilities of 
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false negative associated with c 0 are smaller than or equal to those of 
false negative associated with the observed cutoff c. But the 
probabilities of false positive associated with c D tend to be larger 
than those associated with c. Since we set the loss ratio to 1 in this 
case, the total probability of misclassification is always minimized by 
using the optimum cutoff c 0 . P(F+) in some test is eight times as large 

as P(F-), while in others the former is only a few times larger. 

Setting the most appropriate loss ratio will be a problem when Huynh's 
method to evaluate the optimum cutoff is adopted. Also, his method is 
more sensitive for the smaller loss ratios than larger ones, say Q*=10- 
20. Our data showed that many Master Validation Examinations of the 
end-of-lesson tests had the same optimal cutoff c D for loss ratios 
between 8 and 20. If his intention was to control the false positive 
errors upon the decision of mastery-non mastery for a linearly related 
curriculum such as mathematics, then the applicability of the method in 
educational settings will be a problem. 
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Table A 


Correlations of Aptitude Scores with MVE Scores, 

First Completion Time, Mastery Time, and Test Completion Time 


Lesson 

MVE 

scores 

First 

completion 

time 

Mastery 

time 

Test 

completion 

time 


* 

* 

* 

* 

103 

.45 

-.39 

-.08 

-.32 



* 

* 


104a 

.17 

o 

"d- 

• 

1 

i 

u> 

09 

-.06 

104b 

time data 

was lost 




* 

* 

* 

* 

105 

.31 

-.42 

-.49 

-.32 


* 



* 

201a 

.52 

.04 

CO 

o 

-.32 



* 

* 

* 

201b 

.16 

-.42 

-.42 

-.33 


* 




202a 

.38 

-.12 

-.25 

-.10 


* 



* 

202b 

.34 

-.19 

-.19 

-.42 

204 

.19 

-.16 

-.22 

-.26 


* 

* 

* 

* 

205a 

.39 

-.38 

-.45 

-, 32 


* 




205b 

.47 

-.00 

CM 

-.20 


* 



* 

206a 

.42 

i 

• 

o 

u> 

-.14 

-.42 

206b 

.27 

-.25 

r*. 

CM 

1 

-.22 





* 

206c 

<r 

CM 

• 

CM 

O 

• 

-.02 

-.40 

207 

.24 

-.23 

-.26 

-.15 





* 

301 

.24 

-.03 

-.13 

-.34 



* 



303 

.10 

-.39 

-.26 

-.19 


* 


* 

* 

304 

.60 

-.14 

-.36 

-.51 



* 

* 

* 

305 

.17 

-.35 

-.36 

-.45 


* 

* 

* 

* 

307 

.52 

-.54 

-.59 

-.57 





* 

308 

.20 

i 

• 

o 

o 

-.03 

-.54 


* 

* 

* 

* 

401 

.38 

-.41 

-.41 

-.39 


* 


* 

* 

402 

.47 

r-> 

CM 

• 

1 

-.39 

-.39 


* 


* 


403 

.48 

-.24 

-.31 

.09 





* 

404 

.10 

-.27 

-.27 

CM 

CO 

1 

405a 

.27 

-.15 

-.27 

.12 

405b 

.05 

CO 

O 

• 

i 

-.19 

-.05 


* 




405c 

.31 

-.11 

-.06 

CM 

O 


* 

£ < .05 
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TABLE B 


lesson 


103 
104a 1 
104b-' 
105 
201a~) 

201bJ 
202a 
202b 
203a 


I 


203b 
203c. 
205a 
205b 
206a 
206b v 
206c J 
207 
301 

303 

304 

305 

307 

308 

401 

402 

403 

404 
405a 
405b 
405c 


Description of Contents in the Lessons of Chanute 

_ Content __ 

Principles of Gas Engine 

Identification of Parts and Purpose of 
Gasoline Engine Compressor 

Cooling System 

Air and Exhaust System 

Fundamentals of Electricity 
Batteries 

Electrical Schematics 


Cranking Motors, DC Charging System 
AC Charging System 

Battery Ignition 

Emission Control 
Diesel Engines 
Lighting System 
Warning System 
Clutches 
Basic Hydraulics 

Fluid Couplings/Torque Converters 

V-Joints/Propeller Shafts 

Differentials 

Transfer Case/PTO 

Suspension System 

Hydraulic and Mechanical Brakes 

Air Brakes 

Power Assisted Brakes 
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